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(57) ABSTRACT 

A Web crawler application takes advantage of a document 
store's ability to provide a content identifier (CID) having a 
value that is a unique function of the physical storage 
location of a data object or document, such as a Web page. 
In operation, the crawler first tries to fetch the CID for a 
document. If the CID attribute is not supported by the 
document store, the crawler fetches the document, filters it 
to obtain a hash function, and commits the document to an 
index if the hash function is not present in a history table. If 
the CID is available from the docmnent store, the CID is 
fetched from the document store. The crawler then deter- 
mines whether the CID is present in the history table, which 
indicates whether an identical copy of the document in 
question has already been indexed under a different URL. If 
the CID is present, indicating that the document has already 
been indexed, the new URL is placed in the history file but 
the document itself is not retrieved from the document store, 
nor is it filtered again to obtain a CID. If the CID is not 
present in the history table, the full docmnent is retrieved 
and indexed. The CID data structure is an extension of a 
known globally unique ID (GUID). Whereas the GUID is a 
16-byte number, the CID comprises a 16-byte GUID plus an 
additional 6 -byte number. 

22 Claims, 3 Drawing Sheets 
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METHOD AND SYSTEM FOR DETECTING 
DUPLICATE DOCUMENTS IN WEB 
CRAWLS 

CROSS REFERENCE TO RELATED 5 
APPLICAnONS 

The present invention is related to the subject matter of 
co-pending application Ser. No. 09/345,040, filed on even 
date herewith, entitled "Method and System for Incremental 
Web Crawling," which is hereby incorporated by reference. 

TECHNICAL FIELD 

The present invention relates generally to the fields of 
computerized publishing and knowledge management, and 15 
more particularly to Web crawler applications used, e.g., by 
Internet search engines. The invention, however, is not 
limited to use in a Web crawler. On the contrary, the 
invention could be used in a mail server, directory service, 
or any system requiring indexing or one-way replication of 20 
a document store. 

BACKGROUND OF THE INVENTION 

There has recently been a tremendous growth in the 
number of computers connected to the Internet. A client 
computer connected to the Internet can download digital 
information from server computers. Client application soft- 
ware typically accepts commands from a user and obtains 
data and services by sending requests to server applications 
running on the server computers. A number of protocols are 
used to exchange commands and data between computers 
connected to the Internet. The protocols include the File 
Transfer Protocol (FTP), the Hyper Text Transfer Protocol 
(HTTP), the Simple Mail Transfer Protocol (SMTP), and the 
Gopher docimient protocol. ■'^ 

The HTTP protocol is used to access data on the World 
Wide Web, often referred to as "the Web." The Web is an 
information service on the Internet providing documents and 
links between documents. It is made up of numerous Web 
sites located aroimd the world that maintain and distribute 
electronic documents. A Web site may use one or more Web 
server computers that store and distribute documents in a 
number of formats, including the Hyper Text Markup Lan- 
guage (HTML). An HTML document contains text and 
metadata (commands providing formatting information), as 
well as embedded links that reference other data or docu- 
ments. The referenced documents may represent text, 
graphics, or video. 

A Web browser is a client application or, preferably, an jq 
integrated operating system utility that communicates with 
server computers via FTP, HTTP and Gopher protocols. Web 
browsers receive electronic documents from the network 
and present them to a user. 

An intranet is a local area network containing Web servers 55 
and client computers operating in a manner similar to the 
World Wide Web described above. Typically, all of the 
computers on an intranet are contained within a company or 
organization. 

The term "search engine" is often used generically to 60 
describe both true search engines and directories, although 
they are not the same. Search engines typically create their 
listings automatically by "crawling" the Web. Adirectory, on 
the other hand, depends on humans for its listings, i.e., a 
person submits a short description for an entire site or editors 65 
write a description for sites they review. The present inven- 
tion is particularly suited (although not necessarily limited) 
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for use in a search engine of the type that gathers informa- 
tion automatically, i.e., by "crawling" the Web. 

Search engines typically include a "crawler" (also called 
a "spider" or "bot") that visits a Web page, reads it, and then 
follows links to other pages within the site. The crawler 
returns to the site on a regular basis to look for changes. 
Everything the crawler finds goes into an index, which is 
another part of die search engine. The index is like a file or 
container holding a copy of every Web page that the crawler 
finds. If a Web page changes, then the index is updated with 
new information. The search engine software, which is yet 
another part of the search engine, is a program that sifts 
through the pages recorded in the index to find documents 
fulfilling a search query submitted by a user. The search 
engine software will typically rank the matches in accor- 
dance with their relevance. 

Once it is given a set of start addresses and restriction 
rules, a crawler can retrieve documents following all recur- 
sive links from the documents that correspond to the start 
addresses that pass the restriction rules. The primary appli- 
cation of the crawler is to build an index of a set of 
documents, so that the index can be searched by end-users 
that want to locate documents that match certain search 
criteria. 

A crawler can retrieve documents from different stores. 
Although the primary store is the Web, a crawler can retrieve 
documents from a mail store, database, or anything else that 
has textual content. 

A shortcoming of Web crawlers and other applications 
required to access documents stored in one or more docu- 
ment stores is that resom"ces are wasted on retrieving the 
documents from the store in order to determine whether the 
same document has already been processed or indexed. For 
example, a document must be fetched from a document store 
and filtered to obtain a hash function, and then the hash 
function must be compared to the hash functions of previ- 
ously processed documents to determine whether the new 
document is a replica of another document already repre- 
sented in the index. There is a need for an improved method 
and system for identifying duplicate documents, and using 
this information to avoid unnecessarily retrieving and pro- 
cessing such duplicates. The present invention achieves this 
goal. 

Further background information about Web crawlers is 
provided below, and may also be found in U.S. patent 
application Ser. No. 09/105,758, filed Jun. 26, 1998, 
"Method of Web Crawling Utilizing Crawl Numbers," and 
U.S. patent application Ser. No, 09/107,227, filed Jun. 30, 
1998, "SynchK>nizing Crawler With Notification Source." 

SUMMARY OF THE INVENTION 

The present invention provides an improved way to 
access documents (including Web pages, file system 
documents, e-mail messages, etc.) stored in one or more 
document stores on a computer network. For example, the 
invention could be used in a Web crawler application, mail 
server, directory service, or any system requiring indexing or 
one-way replication of a document store. The invention is 
particularly directed to a method and system for identifying 
duplicate documents in a document store, and using this 
information to avoid unnecessarily retrieving and processing 
such duplicates. 

A Web crawler application in accordance with the present 
invention takes advantage of a document store *s ability to 
provide a content identifier (CID) having a value that is 
either a unique function of the physical storage location of 



08/20/2003, EAST version: 1.04.0000 



us 6,547^ 

3 

a data object or document, such as a Web page, or, 
alternatively, a unique function of the content of the docu- 
ment (i.e., identical documents stored in different locations 
would have equal CIDs). According to the invention, the 
crawler first tries to fetch the CID for a document. If the CID 5 
attribute is not supported by the document store, the crawler 
processes the document in accordance with a prior method, 
e.g., by fetching the document, filtering ii to obtain a hash 
function, and committing the document to an index if the 
hash function is not present in a History Table (or a separate 10 
table associated with the History Table). On the other hand, 
if the CID is available from the document store, it is fetched 
by the crawler. The crawler then determines whether the 
CID is present in the History Table, which indicates whether 
the document in question has already been indexed xmder a 15 
different URL. If the CID is present, indicating that the 
document has already been indexed, the new URL is placed 
in the History Table but the document itself is not retrieved 
from the document store, nor is it filtered again to obtain a 
CID. If the CID is not present in the History Table or 20 
separate CID table, the full document is retrieved and 
indexed. 

Note that, when the CID is a function of the physical 
location of the document, as in the exemplary implementa- 
tion described below, it does not achieve better duplicate 25 
detection if the duplicate documents are located in different 
stores (e.g., different Web sites). However, it does solve the 
problem of locating duplicates within the same site, which 
is a very relevant problem for sites with multiple virtual 
directories, or mail stores. On the other hand, the present 30 
invention could be implemented such that duplicates at 
different storage locations (e.g., where a document is copied 
to another location and not changed) would have equal CIDs 
and thus would be identifiable as duplicates based on the 
CID property. Thus, for example, in the latter embodiment 35 
a unique CID would be generated whenever a document is 
modified and stored. If this document is copied elsewhere, 
but remains unmodified such that it keeps the same CID, 
then the present invention can be used to detect that dupli- 
cates are stored at different locations. 40 

Preferably, the CID data structure will be an extension of 
a known globally unique ID (GUID). For example, whereas 
the GUID is a 16-byte number, the CID of the present 
invention may comprise a 16-byte GUID plus an additional 
6 -byte number. 45 

Other features of the present invention are described 
below. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing summary and the following detailed 50 
description of presently preferred embodiments are better 
understood when read in conjunction with the appended 
drawings, in which: 

FIG. 1 is a block diagram representing a general purpose 
computer system in which aspects of the present invention 55 
may be incorporated. 

FIG. 2 is a schematic diagram representing a computer 
network in which aspects of the present invention may be 
incorporated. 

FIG. 3 is a flowchart of a method for detecting duplicate 
documents using a content identifier attribute in accordance 
with the present invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

The present invention provides a mechanism for obtaining 
information pertaining to electronic documents that reside 
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on one or more server computers. While the following 
discussion describes an embodiment of the invention that 
crawls the Internet within the context of the World Wde 
Web, the present invention is not limited to that use. This 
present invention may also be employed on any type of 
computer network or individual computer having data stores 
for files, e-mail messages, databases and the like. The 
information from all such stores can be processed together 
or separately in accordance with the invention. 

The present invention will now be explained with refer- 
ence to a presently preferred embodiment thereof. An over- 
view of Web crawler methods is provided first. After this 
overview, a description of exemplary computer and network 
environments is provided. Finally, a detailed description of 
the inventive methods of incremental Web crawling, and 
detecting duplicate documents in Web crawls using deleted 
documents counts, is provided. 

Overview of Web Crawler Methods 

A server computer on the Internet is sometimes referred to 
as a "Web site," and the process of locating and retrieving 
digital data from Web sites is sometimes referred to as "Web 
crawling." Web crawling may entail initially performing a 
first full crawl wherein a transaction log is "seeded" with one 
or more document address specifications. (The term address 
specification, address specifier, and URL are used inter- 
changeably in this specification. These terms refer to any 
type of naming convention that may be used to address a file, 
and are not intended to imply that the present invention is 
limited to Internet applications.) Each document listed in the 
transaction log is retrieved from its Web site and processed. 
The processing may include extracting the data from each of 
these retrieved documents and storing that data in an index, 
or other database, with an associated "crawl number modi- 
fied" that is set equal to a unique current crawl number that 
is associated with the first full crawl. A hash value (such as 
MD5) for the document and the document's time stamp may 
also be stored with the document data in the index. The 
document URL, its hash value, its time stamp, and its crawl 
number modified may then be stored in a persistent History 
Table used by the crawler to record documents that have 
been crawled. 

Incremental crawls or additional full crawls may be 
performed after the first full crawl. During a full crawl, the 
transaction log is seeded with one or more document address 
specifications, which are used to retrieve the document 
associated with the address specification. The retrieved 
documents are recursively processed to find any linked 
document address specifications contained in the retrieved 
document. The document address specification of the linked 
document is added to the transaction log the first lime it is 
found during the current crawl. The full crawl builds a new 
index based on the documents that it retrieves based on the 
seeds in its transaction log and the gathering rules that 
constrain the search. During the course of the full crawl, the 
document address specifications of the retrieved documents 
(for example, the docmnents* URLs) are compared to asso- 
ciated entries in the History Table (if there are any entries). 
URLs that are marked as having been crawled during this 
crawl are ignored. 

An incremental crawl retrieves only documents that may 
have changed since the previous crawl. The incremental 
crawl uses the History Table and its transaction log is seeded 
with the document address specifications (URLs) contained 
in the History Table. In an incremental crawl, a document 
may be retrieved from a Web site if its time stamp is later 
than the time stamp stored in the History Table. This type of 
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Web crawl is described in the above-cited U.S. patent 23 by a hard disk drive interface 32, a magnetic disk drive 

application Ser. No. 09A05,758 ("Method of Web Crawling interface 33, and an optical drive interface 34, respectively. 

Utilizing Crawl Numbers"). The drives and their associated computer-readable media 

To determine whether a substantive change has been provide non-volatile storage of computer readable 

made to the document, a Web crawler may filter extraneous 5 instructions, data structures, program modules and other 

data from the document (e.g., formatting information) and data for the personal computer 20. Although the exemplary 

then compute a hash value for the remaining document data. environment described herein employs a hard disk, a rcmov- 

The hash value would then be compared to a hash value able magnetic disk 29 and a removable optical disk 31, it 

stored in the History Table. Different hash values would should be appreciated by those skilled in the art that other 

indicate that the document has changed. If the hash value has ^ o types of computer readable media which can store data that 

changed, the document may be marked as modified and its is accessible by a computer, such as magnetic cassettes, flash 

crawl number modified may be set to the current crawl memory cards, digital video disks, Bernoulli cartridges, 

number (if applicable for the crawler. random access memories (RAMs), read-only memories 

Searches of the index, or database, created by the Web ^^^^^^ '"^y exemplary 

crawler can use the crawl number modified as a search operaUng environment. 

parameter if a user is only interested in documents that have A number of program modules may be stored on the hard 

changed, or that have been added, since a previous search. magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 

In response to a request for only modified documents, the including an operating system 35, one or more application 

intermediate agent would implicitly add a limitation to the programs 36, other program modules 37 and program data 

search that the search return only documents that have a ^ may enter commands and information into the 

crawl number modified that is subsequent to (greater than) personal computer 20 through input devices such as a 

a stored crawl number associated with a prior search. keyboard 40 and pointing device 42. Other input devices 

Computer Environment "^^^ ^^'^^"^^ ^ microphone, joystick, game pad, 

, , ^ , satelhte disk, scanner or the like. These and other input 

Web crawler programs execute on a computer. FIG. l and ^^^-^^ connected to the processing unit 21 
tlie followmg discussion are mlended to provide a brief through a serial port interface 46 that is coupled to the 
general descnption of a suitable computing environment in ^^^^ connected by other interfaces, such 
which the mvenuon may be miplemented. Although not ^ , ^^^^^^^ ^^^^ (USB) 
required, the mvention wiU be described m the general ^ ^^^^^^ 47 ^^^^ „f , ^^^j^ ^l^^ 
context of computer-executable instructions, such »s pro- 30 connected to the system bus 23 via an interface, such as a 
gram modules, bemg executed by a computer, such as a ^^^^ ^ ^^^^-^^ ^^,^5,^^ 47 
chenl workstation or a server Generally, program modules ^,,^3 , j^aUy include other peripheral output devices 
mclude routmes, programs, objects, components, data struc- ^^^j ^(^^^^ ^^^^ ^ ^^^^ ^,^^3 exemplary 
tures and the IJce that perform particular tasks or implement p^, ^ ^^^^^^^ ^ ^^3, S^^,l 
particular abstract data types. Moreover, those skilled m the 35 Computer System Interface (SCSI) bus 56, and an external 
art will appreciate that the mvention may be practiced with ^^^-^^ comiected to the SCSI bus 56. 
other computer system configurations, includmg hand-held ™ . , . 
devices, multi-piocessor systems, microprocessor-based or P«'«'°'" .=°'°P'^'" 20 may operate m a networked 
programmable consumer electronics, network PCs, environment usmg logical connecUons to one or more 
minicomputers, mainframe computers and the like. The 40 computers such as a remote computer 49. The 
invention may also be practiced in distributed computing '^"'o'^ computer 49 may be another personal computer a 
environments where tasks are performed by remote process- ^ » f'^^j'^ peer device or other 
ing devices that are linked through a communications net- "f^^' 'ypically includes many or all of 
work. In a distributed computing environment, program elements described above relative to the personal corn- 
modules may be located in both local and remote memory 45 f^^' .f' »lth°"gb only a memoiy storage device 50 has 
storage devices. been lUustrated m FIG. 1. The logical connections depicted 
. ^ , , in FIG. 1 mclude a local area network (LAN) 51 and a wide 

As shown m FIG. 1 an exemplary general purpose area network (WAN) 52. Such networking environments are 

computing system includes a conventional pereonal com- commonplace in offices, enterprise-wide computer 

puter 20 or the hke. mcludmg a processing umt 21, a system networks, intranets and the Internet, 

memory 22, and a system bus 23 that couples vanous system 50 „„ , • ^ 

components including the system memory to the processing '^^'^ "^'^ ^ networkmg envuonnient the per- 
unit 21. The system bus 23 may be any of several types of computer 20 is connected to the LAN 51 through a 
bus structures including a memory bus or memory network interface or adapter 53. When used in a WAN 
conlroller. a peripheral bus, and a local bus using any of a networking environment, the personal computer 20 typically 
variety of bus architectures. The system memory includes 55 'ncludes a modem 54 or other means for estabhsh.ng corn- 
read-only memory (ROM) 24 and random access memory mumcations over the wide area network 52, such as the 
(RAM) 25. Abasic input/output system 26 (BIOS), contain- ^"^^"^^ ^he modem 54, which may be mlernal or external, 
ing the basic routines that help to transfer information f connected to the system bus 23 via the senal port interface 
between elements within the personal computer 20, such as ^6. In a networked environment, program modules depicted 
during start-up, is stored in ROM 24. nie personal computer 60 '° ibf Personal computer 20, or portions thereof 
20 may further include a hard disk drive 27 for reading fiom """y ^'o.'^^?" '1^°^^ ""^""^ ^'"^ ^Se device. It wJl 
and writing to a hard disk, not shown, a magnetic disk drive ^ appreciated that the network connecUons shown are 
28 for reading from or writing to a removable magnetic disk exemplary and other means of estabhshing a communica- 
29, and an optical disk drive 30 for reading from or writing computers may be used, 
to a removable optical disk 31 such as a CD-ROM or other 65 Network Envu^onment 

optical media. The bard disk drive 27, magnetic disk drive As noted, the computer described above can be deployed 

28, and optical disk drive 30 are connected to the system bus as part of a computer network. In general, the above descrip- 
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tion applies to both server computers and client computers Web crawler program 200 and indexing engine and search 

deployed in a network environment. FIG. 2 illustrates one engines 300 may reside on different computers, 

such exemplary networic environment in which the present Additionally, the Web browser 35-1 and the Web crawler 

invention may be employed. program 200 may reside on a single computer. Further, the 

As shown, a Web server 100 is interconnected with a 5 indexing and search engines 300 are not required by the 

number of other server computers, such as a database server present invention. The Web crawler program 200 may 

110, a file server 120, and a mail server 130. The Web server retrieve electronic document information for uses other than 

100 includes a document store 140a. Similarly, the database providing the information to a search engine. As discussed 

server, file server, and mail server include document stores above, the client computer(s) 20fl-20c, server computers 

1406, 140c and 140i/, respectively. In this example, the Web 100-130, and remote Web site(s) 170 may communicate 

server, database server, file server, and mail server are part through any type of communications network or medium, 

of a local area network 150. A wide area communications Detecting Duplicate Documents Using Content Identifiers 

network 160 (e.g., the Internet) permits remote Web sites As mentioned above, one of the important and difficult 

170 and client computers 20fl, 2Qb, 20c, etc. (each equipped problems that a crawler has to deal with is the duplication of 
with a browser 35-1), to gain access to Web server 100, e.g., 15 documents. In the exemplary implementation of the 

to search for documents or other forms of electronically invention, duplicates, also called replicas, are documents 

stored information. that have different URLs but the same physical storage 

The Web server 100 contains a Web crawler program 200, location and thus the same content. It is important that the 

which is employed as described above to gather information crawler detect such duphcate documents (assuming that 
for use in a searchable index. In addition, as shown, the Web '^^ creating an index is the primary application of the crawl), 

server contains a search engine 300 and a persistent store First, duphcate detection can be used to improve the per- 

400 for the index. History Table and log files. The Web formance of the indexing, since the duplicate documents do 

crawler program 200 searches for electronic documents not have to be indexed twice. Second, duplicate detection 

distributed on one or more computers connected to the Web can be used to improve the quality of search hits, since 

server 100, including servers 110, 120 and 130, as well as duplicate documents can be presented as one. 

remotely connected Web site(s) 170. Although the network There are two types of duplicate documents: exact dupli- 

150 is shown as a local area network, it may be a WAN or cates in the same document store, and exact or inexact 

a combination of networks that allow the Web server 100 to duplicates in different document stores. An example of the 

communicate with other computers having associated docu- first type is a file on one Web server accessed through 

ment stores available for indexing. different virtual roots. A virtual root is a URL prefix asso- 

The Web crawler program 200 searches its own document ciated with a file system directory on the Web server's 

store 140fl and those of remote servers for electronic docu- computer. (For example, http://msw/hrstufi[/hrweb/blahblah/ 

ments. It retrieves documents and associated data. The policy.html and http://msw/hrweb/policy.hlml could point to 

contents of the electronic documents, along with associated the same physical file.) Another example is a mail message 

data, can be used in a variety of ways. For example, the Web sent to all corporate employees. An example of the second 

crawler 200 may pass the information to indexing/search type of duplicate document is a file copied to two different 

engines 300. The indexing engine 300-1 (see FIG. 3) is a machines. 

computer program that maintains an index 400-1 of elec- a document store's ability to provide a property that 

tronic documents. The index is like the index in a book and uniquely identifies the document regardless of its URL may 

contains reference information and pointers to electronic be employed in accordance with the present invention, 

documents to which the reference information applies. For Typical crawlers have no knowledge of the document store 

example, the index may include keywords and for each specifics. They cannot detect that two documents are dupli- 

keyword a list of addresses. Each address can be used to cates other than by comparing some hash function calculated 

locate a document that includes the keyword. The index may for both documents (e.g., calculating a Message Digest 5 

also include information other than keywords used within (MD5) function for the new document and comparing it with 

the electronic documents. For example, the index may a previously calculated MD5 for the previously crawled 

include subject headings or category names, even when the document). A problem with using a hash function to detect 

Uteral subject heading or category name is not included duphcates is that it requires accessing the document and 

within the electronic document. The type of information filtering it, which constitutes approximately half the time it 

stored in the index depends upon the complexity of the takes to crawl a document (crawl minus indexing). This is 

indexing engine 300-1, which may analyze the contents of particularly critical for crawhng a mail system where a 

the electronic document and store the results of the analysis. message is often sent to many mailboxes or cross-posted to 

AcHent computer, such as computer 20a, includes an OS different public folders or news groups. An example of a 
browser function 35-1 (or separate browser application) that 55 mail server that employs a form of content identifier (in this 

locates and displays documents to a user. When a user at the case a globally unique "single-instance identifier," or SID) is 

client computer desires to search for one or more electronic disclosed in U.S. Pat. No. 5,813,008, Sep. 22, 1998, "Single 

documents, the chent computer transmits data to the search Instance Storage of Information." However, this patent does 

engine requesting a search. At that time, the search engine not disclose the use of a content identifier by an external 

examines its associated index to find documents that may be apphcation, such as a Web crawler, for the purpose of 

desired by the user. The search engine may then return a list detecting duphcate documents. 

of documents to the browser 35-1. The user may then xhe performance of a crawl with respect to eliminating 

examine the list of documents and retrieve one or more the work required to access duphcate documents could be 

desired electronic documents from remote computers. greatly improved if the crawler could detect a duplicate 
As will be readily understood, the system illustrated in 65 before filtering the document. A solution to the problem can 

FIG. 2 is exemplary, and alternative configurations may also be provided ff the document store supports a content iden- 

be used in accordance with the invention. For example, the tifier (CID) property or attribute for each document. In 
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accordance with the present invention, the CID property can The data structure of an exemplary CID is shown in FIG. 

be fetched independently of the document itself and 3. Preferably, the CID (like the SID described by U.S. Pal. 

uniquely identifies the physical document. In other words. No. 5,813,008) includes a globally unique identifier 

no two different documents would have equal CID ("GUID") that uniquely identifies the server that is creating 

properties, and the same document accessed through differ- 5 the CID. The GUID is 16 bytes and includes four subparts: 

ent URLs would return the same CID property, (1) a 60-bit system time; (2) a 4-bit version number; (3) a 

The CID corresponds to the document's physical storage. 16-bit clock sequence 48; and (4) a 48-bit network address. 

Muhiple documents, i.e., documents with different URLs, An implementation of a process which generates GUID 

could share the same physical storage space. This would be values as explained above can be obtained from Microsoft 

the case, for example, with mail sent to 1000 people, where lO Corporation. The implementation resides in the Windows 

the same file is accessed through different virnial roots of a 32-bit software development kit (WIN32SDK) as a program 

Web server, file links in file system, etc. The stores will called UUIDGEN. Since the 16-byte GUID value is much 

assign a CID based on where the document is physically larger than the actual number of servers in any given 

stored. This aspect of the present invention provides a client/server computing system, the 16-byte GUID value can 

performance advantage for the crawls but does not guarantee i5 be compressed and stored locally in an abbreviated form. A 

that documents having identical content but stored in dif- CID also includes a local counter value, e.g., a six-byte 

ferent physical locations will be detected as duphcales. count. The length of the counter value may be adjusted but 

According to the invention, the crawler (gatherer) fetches should be sufficiently long to avoid a short tenm rollover 

the CID property of the document first, looks in the History problem. Rollover should be avoided in order to ensure 

Table (or another Uble of CIDs), and, if it finds an existing 20 umque CID values. In addition, it is desirable to avoid CID 

CID of the same value, just notifies the indexing engine of ^^^^^^ that are the same as the MD5 values used by 

the duplicate without filtering the document. In the case document stores that do not support the CIDs of the present 

where the documents are gathered through notifications (the mvenlion. 

gatherer gets a notification with the URL of the document in sum, the present invention provides an improved 

whenever the document gets modified, created or deleted), ^5 web/document crawling method and system. An important 

the CID property is passed by the notification source. This feature of the preferred embodiments of the invention is the 

eUminates the need to connect to the server and fetch the use of a CID property that can be easily provided by a 

CID property. document store to enhance the efiBciency and usefulness of 

If the document store does not support the CID property, a crawler or like application. It is understood, however, that 

the gatherer may use MD5-based duplicate detection, i.e., it the invention is susceptible to various modifications and 

may fall back to prior methods of duplicate detection. alternative constructions. It should be understood that there 

Referring now to FIG. 3, the dupUcate detection proce- is no intention to limit the invention to the specific con- 

dure begins at step S20, wherein, for a URL in the History stnictions described herein. On the contrary, the invention is 

Table 400-2, the crawler fetches the CID for that document intended to cover all modifications, alternative 

from the document store 140. In step 821, the crawler constructions, and equivalents failing within the scope and 

determines whether a CID having the same value as the one spirit of the invention. 

just obtained fmm the document store exists m the History ,j ^ ^^^^^ ^ .^^^^^^^ ^ 

S26 " T^' d performed; .f so, ^pieniented in a variety of database and database manage- 

step is per orme . mGUi applications, including electronic messaging systems 
In step 822, the document corresponding to the new URL mail severs. The various techniques described herein may 
IS fetched from the document store 140. In step 823, the be implemented in hardware or software, or a combination 
document is filtered- (Filtering means parsing the document of both. Preferably, the techniques are implemented in 
format to retrieve any useful information (text and computer programs executing on programmable computers 
properties) for crawling applications, such as mdexmg. This 45 that each include a processor, a storage medium readable by 
process also makes aU documents in aU different formats the processor (including volatile and non-volatile memory 
look the same to the crawler appUcation.) In step 824, the ^nd/or storage elements), at least one input device, and at 
URL and CID are committed to the History Table 400-2, and i^ast one output device. Program code is apphed to data 
then m step 825 the document is committed to the mdex entered using the input device to perform the fiinctions 
^^'^* 50 described above and to generate output information. The 
As indicated in the block for step S27, if the QD attribute output information is apphed to one or more output devices, 
is not supported by the document store 140, the crawler Each program is preferably implemented in a high level 
processes the document in accordance with a prior procedural or object oriented programming language to 
technique, i.e., by fetching the document, filtering it to communicate with a computer system. However, the pro- 
obtain a hash function (MD5), and committing the document 55 grams can be implemented in assembly or machine 
to an index if the hash function is not present in the History language, if desired. In any case, the language may be a 
Table (or a separate table associated with the History Table). compiled or interpreted language. Each such computer pro- 
Thus, a Web crawler application in accordance with the gram is preferably stored on a storage medium or device 
present invention takes advantage of a document store's (e.g., ROM or magnetic diskette) that is readable by a 
ability to provide a unique content identifier (CID) that is 60 general or special purpose programmable computer for 
indicative of the content/physical storage location of a data configuring and operating the computer when the storage 
object or document, such as a Web page. The CID data medium or device is read by the computer to perform the 
structure may be an extension of the globally unique iden- procedures described above. The system may also be con- 
tifier (GUID) described in the above -referenced U.S. Pat. sidered to be implemented as a computer- readable storage 
No. 5,813,008. For example, whereas the GUID is a 16-byte 65 medium, configured with a computer program, where the 
number, the CID of the present invention will preferably storage medium so configured causes a computer to operate 
comprise a 16-byte GUID plus an additional 6-byte number. in a specific and predefined manner. 
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We claim: 

1. A computer-based method for use in crawling a 
computer-readable document store, and partictilarly for 
detecting duplicate documents during a crawl so as to avoid 
unnecessarily retrieving and processing such duplicates, 5 
comprising the following acts: 

(a) obtaining from the document store a content identifier 
(CID) corresponding to a particular document, wherein 
the CID is characterized in that: (1) the CID can be 
fetched independently of the document itself, (2) the lO 
CID uniquely identifies the physical document in that 
no two different documents would have equal CIDs, 
and (3) the same document accessible through different 
URLs would have the same CID; 

(b) determining whether the value of the CID is the same "^^ 
as the value of a previously obtained CID correspond- 
ing to another document; and 

(c) if the value of the CID is not the same as the value of 
a previously obtained CID, fetching the particular 
document from the document store. 

2. A method as recited in claim 1, wherein the CID is a 
number that has a prescribed format and is globally unique. 

3. A method as recited in claim 2, wherein the CIDs of any 
two different documents will have different values. 

4. A method as recited in claim 3, wherein the CID is 
generated as a value which is a function of the physical 
storage location of the document. 

5. A method as recited in claim 4, wherein the CID of a 
document that is copied from a first storage location to a 
second storage location remains unchanged if the document 
in immodified. 

6. A method as recited in claim 1, wherein the CID is 
obtained from the document store by querying the document 
store with the address specifier of the particular document. 

7. A method as recited in claim 1, further comprising 
indexing the particular document after it has been fetched 
from the document store. 

8. A method as recited in claim 1, further comprising, if 
the value of the CID is the same as the value of a previously 
obtained CID, storing the address specifier of the particular 
document in a history table, without fetching the particular 
document from the document store. 

9. A method as recited in claim 1, wherein the method is 
executed by a server computer coupled by a network to the 
document store. 

10. A method as recited in claim 1, wherein the method is 
employed in connection with a Web crawler application. 

11. A method as recited in claim 1, wherein the method is 
employed in connection with a mail server application. 

12. A method as recited in claim 1, wherein the method is 
employed in connection with a directory service. 

13. A method as recited in claim 1, wherein the method is 
employed in connection with a system requiring indexing or 
one-way replication of data, to optimize replication by not 
copying duplicate data. 

14. A Web crawling method, comprising: 

providing a history table containing URLs of documents 
that have been indexed during a previous crawl, and 
content identifiers (CIDs) for such documents; gg 

for a first URL encountered during an incremental crawl, 
fetching from a document store a CID for the document 
corresponding to the first URL; 
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determining whether a CID having the same value as the 
one just obtained from the document store exists in the 
history table; 

if a CID having the same value is not present in the history 
table, performing the following acts: (1) fetching the 
document corresponding to the first URL from the 
document store; (2) committing the first URL and CID 
to the history table; and (3) committing the document 
corresponding to the first URL to an index; and 

if a CID having the same value is present in the history 
table, committing the first URL to the history table. 

15. A method as recited in claim 14, wherein the CID 
comprises a data stmcture that is an extension of a globally 
unique identifier (GUID). 

16. A method as recited in claim 15, wherein the CID data 
structure includes (1) a 60-bit system time; (2) a 4-bit 
version number; (3) a 16-bit clock sequence 48; and (4) a 
48-bit network address; and (5) a local counter value. 

17. A method as recited in claim 16, wherein the local 
counter value is a six-byte number. 

18. A computer-readable storage medium containing com- 
puter executable code for instructing a computer to carry out 
the steps recited in claim 14. 

19. A computer system comprising: 
a server computer; 

a document store operatively coupled to the server 
computer, wherein the document store contains a plu- 
rality of electronic documents, and wherein the docu- 
ment store provides content identifiers (CIDs) for docu- 
ments in the document store, wherein the CID is 
characterized in that: (1) the CID can be fetched 
independently of the document itself, (2) the CID 
imiquely identifies the physical document in that no 
two different documents would have equal CIDs, and 
(3) the same document accessible through different 
URLs would have the same CID; 

a computer readable storage medium operatively coupled 
to the server computer; and 

a computer-executable crawler application stored on the 
computer readable storage medium, wherein the 
crawler application is provided with the CIDs of 
selected documents on request. 

20. A system as recited in claim 19, wherein the crawler 
application, when executed by the server, causes the follow- 
ing acts to be carried out by the server: 

obtaining from the document store the CID corresponding 
to a particular document; 

determining whether the value of the CID is the same as 
the value of a previously obtained CID corresponding 
to another document; and 

if the value of the CID is not the same as the value of a 
previously obtained CID, fetching the particular docu- 
ment from the document store. 

21. A system as recited in claim 20, wherein the server 
computer comprises a member of a group consisting of a 
Web server, a mail server, a file server and a database server. 

22. A system as recited in claim 19, wherein each CID has 
a value which is a function of the physical storage location 
of the document to which it relates. 
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